Package: sklearn.preprocessing

change raw feature vectors into a representation that is more suitable for the downstream estimators

Standardization, or mean removal and variance scaling

Standardization of datasets is a common requirement for many machine learning estimators implemented in the scikit. They might behave badly if the individual features do not more or less look like standard normally distributed data:Gaussian with zero mean and unit variance

The function scale provides a quick and easy way to perform this operation on a single array-like dataset:



In [2]:

    
from sklearn import preprocessing
import numpy as np
X = np.array([[1.,-1.,2.],
              [2., 0.,0.],
              [0., 1.,-1.]])
X_scaled = preprocessing.scale(X)
X_scaled









    Out[2]:





array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])



In [3]:

    
X_scaled.mean(axis = 0)









    Out[3]:





array([ 0.,  0.,  0.])



In [4]:

    
X_scaled.std(axis=0)









    Out[4]:





array([ 1.,  1.,  1.])

Another utility class: StandardScaler, that implements the Transformer API to compute the mean and standard deviation on a training dataset so as to be able to later reapply the same transformation on the testing set.



In [5]:

    
scaler = preprocessing.StandardScaler().fit(X)
scaler









    Out[5]:





StandardScaler(copy=True, with_mean=True, with_std=True)



In [6]:

    
scaler.mean_









    Out[6]:





array([ 1.        ,  0.        ,  0.33333333])



In [7]:

    
scaler.scale_









    Out[7]:





array([ 0.81649658,  0.81649658,  1.24721913])



In [8]:

    
scaler.transform(X)









    Out[8]:





array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])



In [9]:

    
scaler.transform([[-1.,1.,0.]])









    Out[9]:





array([[-2.44948974,  1.22474487, -0.26726124]])

Scaling features to a range

Example: scaling feature to lie between a given minimum and maximum value, often between zero and one. Or the maximum absolute value of each feature is scaled to unit size Use MinMaxScaler or MaxAbsScaler.



In [10]:

    
X_train = np.array([[1., -1., 2.],
                    [2.,  0., 0.],
                    [0.,  1.,-1.]])
min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
X_train_minmax









    Out[10]:





array([[ 0.5       ,  0.        ,  1.        ],
       [ 1.        ,  0.5       ,  0.33333333],
       [ 0.        ,  1.        ,  0.        ]])



In [11]:

    
X_test = np.array([[-3., -1., 4.]])
X_test_minmax = min_max_scaler.transform(X_test)
X_test_minmax









    Out[11]:





array([[-1.5       ,  0.        ,  1.66666667]])



In [12]:

    
min_max_scaler.scale_









    Out[12]:





array([ 0.5       ,  0.5       ,  0.33333333])



In [13]:

    
min_max_scaler.min_









    Out[13]:





array([ 0.        ,  0.5       ,  0.33333333])

MaxAbsScaler works in a very similar fashion, but scales in a way that the training data lies within the range [-1,1] by dividing through the largest maximum value in each feature. It is used for data that is already centered at zero or sparse data.

Scaling sparse data

MaxAbsScaler and maxabs_scale were specifically designed for scaling sparse data, and are the recommend way to go about this.

...

Scaling data with outliers

If your data contains many outliers, scaling using the mean and variance of the data is likely to not work very well. You can use robust_scale and RobustScaler as drop-in replacements instead. ...

Centering kernel matrices

...

Normalization

Normalization is the process of scaling individual samples to have unit norm. This process can be useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples.

The function normalize provides a quick and easy way to perform this operation on a single array-like dataset, either using the l1 or l2 norms:



In [16]:

    
X = [[ 1., -1., 2.],
     [ 2.,  0., 0.],
     [ 0.,  1.,-1.]]
X_normalized = preprocessing.normalize(X,norm='l2')
X_normalized









    Out[16]:





array([[ 0.40824829, -0.40824829,  0.81649658],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.        ,  0.70710678, -0.70710678]])

Sparse input

normalize and Normalizer accept both dense array-like and sparse matrices from scipy.sparse as input.

Binarization

Feature binarization is the process of thresholding numerical features to get boolean values. ...

As for the Normalizer, the utility class Binarizer is meant to be used in the early stages of sklearn.pipeline.Pipeline.



In [17]:

    
X = [[ 1., -1., 2.],
     [ 2.,  0., 0.],
     [ 0.,  1.,-1.]]
binarizer = preprocessing.Binarizer(). fit(X) # fit does nothing
binarizer









    Out[17]:





Binarizer(copy=True, threshold=0.0)



In [18]:

    
binarizer.transform(X)









    Out[18]:





array([[ 1.,  0.,  1.],
       [ 1.,  0.,  0.],
       [ 0.,  1.,  0.]])



In [19]:

    
binarizer = preprocessing.Binarizer(threshold=1.1)
binarizer.transform(X)









    Out[19]:





array([[ 0.,  0.,  1.],
       [ 1.,  0.,  0.],
       [ 0.,  0.,  0.]])

The preprocessing module provides a companion function binarize to be used when the transformer API is not necessary.

binarize and Binarizer accept both dense array-like and sparse matrices from scipy.sparse as input.

Encoding categorical features

Integer representation cannot be used directly with scikit-learn estimators, as these expect continuous input, and would interpret the categories as being ordered, which is often not desired.

One possibility to convert categorical features to features that can be used with scikit-learn estimators is to use a one-of-K or one-hot encoding, which is implemented in OneHotEncoder. This estimator transforms each categorical feature with m possible values into m binary features, with only one active.



In [21]:

    
enc = preprocessing.OneHotEncoder()
enc.fit([[0,0,3],[1,1,0],[0,2,1],[1,0,2]])









    Out[21]:





OneHotEncoder(categorical_features='all', dtype=<type 'float'>,
       handle_unknown='error', n_values='auto', sparse=True)



In [22]:

    
enc.transform([[0,1,3]]).toarray()









    Out[22]:





array([[ 1.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.]])

In the result, the first two numbers encode the first feature, the next set of three numbers the second feature and the last four the third feature.

Imputation of missing values



In [ ]:

References

scikit-learn: Preprocessing data



In [ ]: